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Abstract — For  q- ary  //-sequences,  we  develop  the  concept  of 
similarity  functions  that  can  be  used  (for  q  =  4)  to  model  a  ther¬ 
modynamic  similarity  on  DNA  sequences.  A  similarity  function  is 
identified  by  the  length  of  a  longest  common  subsequence  between 
two  q-nvy  //-sequences.  Codes  based  on  similarity  functions  are 
called  DNA  codes  [10].  DNA  codes  are  important  components  in 
biomoleeular  computing  and  other  biotechnieal  applications  that 
employ  DNA  hybridization  assays.  We  present  our  unpublished 
results  [8]  connected  with  the  conventional  deletion  similarity 
function  [1]  used  in  the  theory  of  error-correcting  codes.  The 
main  aim  of  this  paper  -  to  obtain  lower  bounds  on  the  rate  of 
optimal  DNA  codes  for  a  biologically  motivated  [11],  [12],  [13] 
similarity  function  called  a  similarity  of  blocks.  We  also  present 
constructions  of  suboptimal  DNA  codes  based  on  the  parity-check 
code  detecting  one  error  in  the  Hamming  metric  [3]. 

I.  Introduction  and  Biological  Motivation 

Single  strands  of  DNA  are,  abstractly,  (A,C,G,T)- 
quaternary  sequences,  with  the  four  letters  denoting  the  re¬ 
spective  nucleic  acids.  Strands  of  DNA  sequence  are  oriented; 
for  instance,  X  =  AACG  is  distinct  from  Y  =  GCA  .1. 
Furthermore,  DNA  is  ordinarily  double  stranded:  each  se¬ 
quence  X,  or  strand,  occurs  with  its  reverse  complement 
X',  with  reversal  denoting  that  the  sequences  of  the  two 
strands  are  oppositely  oriented,  relative  to  one  other,  and 
with  complementarity  denoting  that  the  allowed  pairings  of 
letters,  opposing  one  another  on  the  two  strands,  are  (.4,  T) 
or  (C,  G) — the  canonical  Watson-Crick  pairings.  For  instance, 
two  sequences  A'  =  AACG  and  X'  =  CGTT  are  reverse 
complement  of  one  another.  Obviously,  for  any  strand  X,  we 
have  (X')'  =  X. 

Whenever  two,  not  necesseraly  complementary,  oppositely 
directed  DNA  strands  '’mirror”  one  another,  they  are  capable 
of  coalescing  into  a  DNA  duplex.  The  process  of  forming 
DNA  duplexes  from  single  strands  is  referred  to  as  DNA 
hybridization.  The  greatest  energy  of  DNA  hybridization  (the 
greatest  stability  of  DNA  duplex)  is  obtained  when  the  two  se¬ 
quences  are  reverse  complement  of  one  another  and  the  DNA 
duplex  formed  is  a  Watson-Crick  (WC)  duplex.  However,  there 
are  many  instances  when  the  formation  of  non-WC  duplexes 
are  energetically  favorable.  The  energy  of  DNA  hybridization 
(the  stability  of  DNA  duplex)  £{X ,Y)  of  two  single  DNA 
strands  X  and  Y  is,  to  a  first  approximation,  measured  by 
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the  longest  length  of  a  common  subsequence  (not  necessary 
contiguous)  of  either  strand  and  the  reverse  complement  of  the 
other  [10].  For  two  reverse  complementary  strands  X  and  X' 
of  length  n,  this  measure  plainly  equals  their  length  n,  i.e., 
the  maximum  number  of  Watson-Crick  bonds  (complementary 
letter  pairs)  which  may  be  formed  between  two  oppositely 
oriented  strands: 

£{X,X')  =  max  £ (X,  Y')  = 

=  max  £(Y',X)  =  £{X’,X)  =  n.  (1.1) 

For  instance,  if  X  =  AACG  and  X'  =  CGTT ,  then 
£(X,X')  =  4. 

A  DNA  code  X  is  a  collection  of  single  stranded  DNA 
sequences  of  fixed  length  n  where  each  strand  occurs  with 
its  reverse  complement  and  no  strand  in  the  code  equals  its 
reverse  complement  [8],  [10],  i.e.,  if  X  £  X,  then  X'  £  X 
and  X'  A.  In  DNA  hybridization  assays,  the  general 
rule  is  that  formation  of  WC  duplexes  is  good,  but  and  the 
formation  of  non-WC  duplexes  is  bad.  A  primary  goal  of 
DNA  code  design  is  to  be  assured  that  a  fixed  temperature  can 
be  found  that  is  well  above  the  melting  point  of  all  non-WC 
duplexes  and  well  below  the  melting  point  of  all  WC  duplexes 
that  can  form  from  strands  in  the  code.  Thus  the  formation 
of  any  WC  duplex  must  be  significantly  more  energetically 
favorable  than  all  possible  non-WC  duplexes.  DNA  codes  are 
important  components  for  biomoleeular  computing  [5]  and 
other  biotechnieal  applications  that  employ  DNA  hybridization 
assays.  Note  [10]  that  for  these  applications,  the  code  length 
n,  10  <  n  <  40,  is  experimentally  accessible  and  that  codes 
with  more  than  109  codewords  could  soon  be  called  for. 

The  mathematical  analysis  of  DNA  hybridization  is  based 
on  the  concept  of  similarity  functions  that  can  be  used  to 
model  a  thermodynamic  similarity  on  single  stranded  DNA 
sequences.  For  two  quaternary  n-sequences  X  and  Y,  the 
longest  length  of  a  sequence  occurring  as  a  (not  necessary 
contiguous)  subsequence  of  both  is  called  a  deletion  similarity 
Sx(X,Y )  between  X  and  Y.  We  supposed  [8],  [10]  that  the 
deletion  similarity  Sx( X,  Y)  identifies  the  number  of  base  pair 
bonds  in  a  hybridization  assay  between  X  and  the  reverse 
complement  of  Y,  i.e.,  the  energy  of  DNA  hybridization 
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£{X,Y>)  satisfying  (1.1)  is  defined  as  follows 

£{X,Y')  =  £(X',Y)  =  Sx(X,Y )  =  SX(Y,  X).  (1.2) 

Let  D,  1  <  D  <  n  —  1,  be  a  fixed  integer.  A  DNA  code  X  is 
called  a  DNA  code  of  distance  D  based  on  deletion  similarity 
or,  briefly,  an  (n,  D)-code  [8],  [10]  if  the  deletion  similarity 

Sx( X,  Y )  <  n  —  D  —  1,  A',  Y  G  X,  Y  f  A.  (1.3) 

Definition  (1.2)  and  condition  (1.3)  mean  that  the  energy  of 
DNA  hybridization 

£{X,  Y')  <  n  —  D  —  1,  X,  Y  G  X,  Y  ±  A,  (1.4) 

i.e.,  in  (n,D)-code  any  strand  X  and  the  reverse  complement 
of  the  other  strand  Y  can  never  form  >  n  —  D  base  pair 
bonds  in  a  hybridization  assay.  In  the  theory  of  error-correcting 
codes,  condition  (1.3),  by  itself,  specifies  codes  capable  to 
correct  any  combination  of  D  deletions  [1],  [4], 

Example  1.1.  DNA  code  X  =  {A,  A',  Y,  Y'},  where 

A  =  AC  AT,  A'  =  ATGT, 

Y  =  AT  AC,  Y’  =  GTAT,  (1.5) 

is  a  (n,  D)- code  of  length  n  =  4  and  distance  D  =  1  because 
n  —  D  —  1  =  2  and  sequence  Z  =  AT  of  length  2  is  the 
longest  common  subsequence  between  any  pair  of  strands  in 
DNA  code  X.  Hence, 

£(X,  A)  =  £(X',X')  =  Sx( A,  A')  =  2, 

£(Y,Y)  =  £{Y',Y’)  =  SX(Y,  Y’)  =  2, 

£(X  ,Y)  =  £{X  ',Y')  =  Sx(X,Y’)  =  2, 

£{X,Y')  =  £{X',Y)  =  Sx{X,  Y)  =  2. 

In  papers  [11],  [12],  [13],  we  introduced  the  concept  of 
common  block  subsequence,  namely:  a  common  subsequence 
Z  of  sequences  A  and  Y  is  called  a  common  block  sub¬ 
sequence  if  any  two  consecutive  elements  of  Z  which  are 
consecutive  in  A  are  also  consecutive  in  Y  and  vice  versa.  For 
two  quaternary  n-sequences  A  and  Y,  the  longest  length  of  a 
sequence  occurring  as  a  common  block  subsequence  of  both 
is  called  a  block  similarity  between  A  and  Y.  For  example, 
sequence  Z  =  AT  of  length  2  is  the  longest  common  block 
subsequence  between  any  pair  of  strands  in  DNA  code  (1.5). 
Thus,  DNA  code  (1.5)  can  be  considered  as  DNA  (4,  l)-code 
based  on  block  similarity. 

The  first  conventional  issue  of  coding  theory  [3]  for  DNA 
codes  -  to  get  a  lower  random  coding  bound  on  the  rate 
of  DNA  codes  and,  hence,  to  identify  values  of  the  distance 
fraction  D/n  for  which  DNA  code  size  grows  exponentially 
when  n  increases.  The  given  problem  is  more  difficult  than 
the  corresponding  problem  for  deletion-correcting  codes.  For 
instance,  we  cannot  apply  the  best  known  random  coding 
bounds  [6]  on  the  rate  of  deletion-correcting  codes  because 
these  bounds  were  proved  for  codes  which  are  not  invariant 
under  the  reverse  complement  transformation.  For  the  deletion 
similarity,  the  best  known  random  coding  bounds  on  the  rate 


of  DNA  codes  were  established  in  our  papers  [8],  [10].  The 
second  conventional  issue  of  coding  theory  for  DNA  codes 
-  to  present  constructions  of  DNA  codes.  The  aim  of  our 
paper  is  to  obtain  bounds  and  constructions  for  DNA  codes 
based  on  the  deletion  and  block  similarities  which  have  a  good 
biological  motivation  to  model  a  thermodynamic  similarity  on 
DNA  sequences  [11],  [12],  [13].  We  will  study  q- ary  DNA 
codes  which  can  be  considered  as  an  evident  generalization  of 
quaternary  DNA  codes. 

II.  Notations,  Definitions  and  Results 

The  symbol  =  denotes  definitional  equalities  and  the  symbol 
[n]  =  {1,2,...,  n}  denotes  the  set  of  integers  from  1  to  n.  Let 
q  =  2, 4, ...  be  a  fixed  even  integer,  A  =  {0, 1, . . . ,  q  —  1}  be 
the  standard  alphabet  of  size  |,4|  =  q  and  |_'«J  ( [«] )  denote  the 
largest  (smallest)  integer  <  u  (>  u).  Consider  two  arbitrary 
q- ary  n-sequences 

X  =  (x\,xo, ...,  xn)  G  An ,  y  =  ■  ■  ■ ,  yn)  G  An . 

In  what  follows,  we  will  denote  by  symbol  S  =  S(x,y)  an 
arbitrary  symmetric  function  satisfying  conditions 

0  <  S(x,y)  =  S(y,  x)  <  S(x,x)  =  n,  x  G  An,  y  G  An, 

and  called  [10]  a  similarity  function.  Introduce  the  binary 
entropy  function 

llg(u)  =  —U  log  U  —  (1  —  It)  log  (1  ~  U),  ()<(/<  I  . 

Let  l  G  [n]  and  m  =  1,  2,  ...,£.  By  symbol 

z  =  (zi,  z2,  G  A( ,  where  zm  =  xim  =  yjm, 

1  <  h  <  k  <  ■  ■  ■  <  U  <  n,  1  <  ji  <]■><■■■<  jf.  <  n, 

we  will  denote  a  common  subsequence  of  length  |z|  =  £ 
between  x  and  y. 

Definition  1.  [1],  Let  Sx(x,  y),  0  <  Sx(x,y)  <  n,  denote 
the  length  \z\  of  longest  common  subsequence  z  between 
sequences  x  and  y.  The  number  Sx(x,  y)  is  called  a  deletion 
similarity  between  x  and  y. 

Definition  2.  [11],  [12],  [13].  A  common  subsequence 

z  Ur, . z(),  2  <i<n, 

is  called  a  common  block  subsequence  of  length  |z|  =  £ 
between  x  and  y  if  any  two  consecutive  elements  zm,zm+ 1, 
rn  =  1,  2,  —  1,  which  are  consecutive  (separated)  in  x 

are  also  consecutive  (separated)  in  y  and  vice  versa,  i.e, 

( Zm  =  £jin  .  Z,n  i  |  =  Xim- j-i )  <—>■  (zm  =  Ujm  ,  Zfn+1  =  !Jjrl/  +1 )  - 

Definition  3.  [11],  [12],  [13].  Let  .S'3(x,  y)  denote  the 
length  |z  |  of  longest  sequence  occurring  as  a  common  block 
subsequence  z  between  sequences  x  and  y.  The  number 
S@(x,  y),  0  <  S@(x,  y)  <  n,  is  called  a  similarity  of  blocks 
between  x  and  y.  Obviously,  S^(x,y)  <  Sx(x,  y). 
Definition  4.  [8],  [10].  If  q  =  2, 4, . . .,  then 

x  =  (q  —  l)  —  x,  x  G  A  =  {0, 1, . , ,  ,  q  —  1}, 
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is  called  a  complement  of  a  letter  x.  For  an  arbitrary  g- ary 
n-sequence  x  =  (aii,  x2,  ■  ■  ■ ,  xn-i,xn)  £  An,  we  define  its 
reverse  complement  x  =  ( xn ,  Xn~i, . . . ,  x-> ,  ,c  i )  £  An. 

Let  x(l),  x(2), . . . ,  x(A7),  where 

x(fc)  =  (xi(k),Xo(k), . .  ...r.n(k)),  Xj(k)  £  A,  he  [A7], 


be  codewords  of  a  q-ary  code  X  =  (x(l),  x(2), . . . ,  x(TV)} 
of  length  n  and  even  size  TV.  Let  D,  1  <  D  <  n  —  1,  be  an 
arbitrary  integer. 

Definition  5.  [8],  [10].  A  code  X  is  called  a  DNA 
(n,  D)-code  based  on  similarity  function  S  =  S(x,  y)  (briefly, 
(n,D)~ code)  if  the  following  two  conditions  are  fulfilled. 
(i)  For  any  number  k  £  [AC]  there  exists  k'  £  [A7],  k'  f  k,  such 
that  x(fc')  =  x(fc).  (ii)  For  any  k,  k'  £  [A7],  where  k  f  k', 
the  similarity  S{x(k),x(k'))  <  n  —  D  —  1.  We  will  also  say 
that  code  X  is  a  DNA  code  of  length  n,  distance  D  and 
similarity  n  —  D  —  1. 

For  q  =  4,  Definition  5  and  a  biological  motivation  of 
(n,D)- codes  based  on  deletion  similarity  S  =  SA(x,  y)  were 
suggested  in  [10].  If  only  condition  (ii)  is  retained,  then  an 
(n,D)- code  based  on  deletion  similarity  is  a  code  of  length 
n  capable  to  correct  any  combination  of  <  D  deletions  [1], 
A  biological  motivation  of  quaternary  DNA  codes  based  on 
similarity  of  blocks  S  =  S@(x,  y)  was  suggested  in  [11], 

For  given  n  and  D,  we  denote  by  Nq(n,D)  the  maximal 
size  of  (n,  D)-c odes.  If  d,  0  <  d  <  1,  is  a  fixed  number,  then 


Rq(d) 


a  tt  log, Nq{n,  [dn\] 
=  Inn  - - - 

n — >oc  //. 


is  called  a  rate  of  ( n ,  [ rln J ) -codes. 

Let  d  =  dq,  0  <  dq  <  (q  —  l)/g,  be  the  unique  root  of 
equation  —  d\ogq(q  —  1)  +  hq(d).  A  lower  bound  on  the 
rate  Rq(d)  of  DNA  codes  based  on  the  deletion  similarity  is 
presented  by 

Theorem  1.  [8].  If  0  <  d  <  dq,  then 
Rq{d )  >  Rq(d)  =  1  +  d-  2[d\ogq(q  -  1)  +  hq(d)\. 

Example  2.1.  For  the  binary  case,  d2  =  0.13340  and  for 
the  most  important  quaternary  case,  d\  =  0.27029.  In  addition, 
dg  =  0.34902  and  d*  =  0.40324. 

Theorem  2.  For  any  distance  fraction  d,  0  <  d  <  the 
rate  Rq{d)  of  DNA  codes  based  on  the  similarity  of  blocks 
satisfies  inequality 

Rq{d)  >  Rg(d)  =  (1  —  d)  —  Eq(d),  (2.1) 
Eq(d)  =  max  Fq(v,d),  (2.2) 

0<v<d 


Fq(v,d)  4  .(1  ted)  hq  +  2 dhq  Q  -  (2.3) 

Theorems  1  and  2  are  established  with  the  help  of  a  random 
coding  bound  described  in  Sect.  3.  The  proof  of  Theorem  2 
will  be  given  in  Sect.  4.  The  proof  of  Theorem  1  will  be  given 
in  Sect.  5. 

Let  a  number  d@,  0  <  d,q  <  1/2,  be  the  unique  root  of 
equation  Rq  (d)  =  0  or  1  —  d  =  Eq  (d) .  Obviously,  the  lower 


bound  Rq  (d)  >  0  if  0  <  d  <  d@  and  we  will  say  that  dq  is  a 
critical  point  of  the  lower  bound  (d). 

Example  2.2.  We  calculated  d:);  =  0.17888,  df  = 
0.35755,  d(?  =  0.44523  and  d%  =  1/2.  It  means  that  the 
critical  points  for  block  similarity  exceed  the  corresponding 
critical  points  (see.  Example  1)  for  deletion  similarity. 

One  can  easily  understand  that  the  conventional  Hamming 
bound  on  the  size  of  block  codes  with  distance  D+l  is  a  trivial 
upper  bound  on  N^(n,D).  For  D  =  1,  an  improvement  of 
this  trivial  bound  is  given  by 

Theorem  3.  The  maximal  size  Nq  (n.  1 )  <  ( qn~l  +  q)  /2. 

Proof.  Consider  an  arbitrary  code  X  =  (x(fc),  k  £  [TV]}  of 
length  n,  distance  D  =  1  and  block  similarity  n  —  2.  For  each 
x(fc),  there  exists  one  or  two  subsequences  of  length  n  —  1 
obtained  by  deletions  of  the  first  or  the  last  element  ofx(k). 
Let  X  contain  Nt  (TV2)  codewords  which  yield  one  (two) 
subsequences  of  length  n  —  1.  Obviously,  TVi  <  q.  From  item 
(ii)  of  Definition  5,  it  follows  that  there  are  Aq  +  2TV2  distinct 
(n  —  l)-subsequences,  i.e.,  Ni  +  2TV2  <  i/'1' 1 .  Therefore, 
A7  =  TVi  +  TV2  <  ( q "-1  +  q)  /2. 

Theorem  3  is  proved. 

The  following  theorem  is  based  on  a  construction  of  q- ary 
DNA  codes  obtained  with  the  help  of  g-ary  parity-check  codes 
detecting  one  error  in  the  Hamming  metric  [3]. 

Theorem  4.  There  exists  a  q-ary  DNA  code  of  length  n, 
distance  D  =  1,  block  similarity  n  —  D  —  1  =  n  —  2  and  size 

.  TV  =  (g™-1  +  g)  /2  if  n  =  q  =  2m; 

.  N  =  g“-1/2  if  q  =  2m,  n  =  2m+k,  k  >  1; 

•  TV  =  ^g"-1  —  q  /2  if  n  is  a  number  divisible 

by  q  and  4; 

•  TV  =  ^g"-1  —  g"/2  —  q  q-^X  j  /2  if  n  is  a  number 
divisible  by  q. 

If  n  =  q  =  2m,  then  Theorem  3  means  that  the  construction 
of  Theorem  4  is  optimal.  If  g  is  fixed  and  n  — *•  oo,  then 
Theorem  3  means  that  the  construction  of  Theorem  4  is 
asymptotically  optimal. 

Example  2.3.  If  g  =  2  and  n  =  4,  then  a  DNA  code 
of  length  n  =  4,  size  TV  =  4,  distance  D  =  1  and  block 
(deletion)  similarity  n  —  D  —  1  =  2  contains  2  pairs  of  code¬ 
words:  0000  1111  and  0110  1001.  Obviously,  TVf  (4, 1)  = 
V2a(  I.  I)  I. 

Example  2.4.  For  n  =  q  =  4,  the  construction  of  optimal 
DNA  code  from  Theorem  4  is  illustrated  by  the  following 
table  which  contains  43  =  64  codewords  satisfying  the  parity- 
check  condition:  for  each  codeword,  the  sum  of  its  elements 
is  a  number  divisible  by  4.  These  codewords  are  written  as 
/  •  43  =  32  pairs  of  reverse  complement  codewords.  Any  row 
of  the  table  consists  of  1,  2,  or  4  pairs.  In  any  row,  the  first 
(second)  codewords  are  obtained  as  consecutive  cyclic  shifts 
of  the  first  (second)  codeword  of  any  fixed  pair  of  the  row.  If 
we  eliminate  from  the  table  all  15  pairs  from  the  second  and 
fourth  columns  of  the  table,  then  one  can  easily  check  that 
the  rest  17  pairs  will  constitute  a  quaternary  DNA  code  X  of 
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length  n  =  4,  size  AT  =  2  •  17  =  34,  block  distance  D  =  1 


and  block  similarity  n  —  D  — 

1  =  2. 

0000,3333 

0013,0233 

3001,2330 

1300,3302 

0130,3023 

0022,1133 

2002,1331 

2200,3311 

0220,3113 

0031,2033 

1003,0332 

3100,3320 

0310,3203 

0103,0323 

3010,3230 

0301,2303 

1030,3032 

0112,1223 

2011,2231 

1201,2312 

1120,3122 

0121,2123 

0202,1313 

1012,1232 

2020,3131 

2101,2321 

1210,3212 

0211,2213 

1111,2222 

1021,2132 

1102,1322 

2110,3221 

We  mark  by  the  symbol  underline  pairs  of  codewords  (there 
are  10  such  pairs)  from  code  X  which  have  pairwise  deletion 
similarities  <  2.  They  constitute  a  quaternary  DNA  code  of 
length  ii  =  4,  size  TV  =  2  •  10  =  20,  deletion  distance  D  =  1 
and  deletion  similarity  n  —  D  —  1  =  2.  This  means  that  the 
maximal  size  A?4a(4,  1)  >  20.  A  general  constructive  lower 
bound  on  N^(n,  1)  is  given  by 

Theorem  5.  If  n  =  qk,  where  k  =  1,3,...  is  an  odd 
number,  then 

an~ 1 

Ng(n,  1)  >  q- - . 

1  n 

Theorem  5  is  based  on  a  construction  [4]  of  codes  correcting 
single  deletions  or  insertions. 


IV.  Proof  of  Theorem  2 

Let  s,  1  <  s  <  n,  be  an  arbitrary  integer  and 
V^(n,s),  pP{n,  s)  denote  the  sets  from  Lemma  3.1  for 
the  similarity  of  blocks.  For  a  fixed  r/-ary  s-sequence  z  = 
(.'i .  .:-2, . . . ,  zs),  and  j  =  1,2,...,  min{s,  n  —  s  +  1},  we 
introduce  the  concept  of  j-block  presentation  of  z,  i.e.,  a 
partition  of  z  into  j  nonempty  blocks 

z  =  (bi ,  bo , . . . ,  bj_i ,  bj  } ,  (4.1) 

where  each  block  contains  consecutive  elements  of  z.  Let 
x  =  (:/■.] .  . . . .  xn)  G  An,  be  a  fixed  q- ary  n-sequence.  We 

say  that  a  block  presentation  z  of  the  form  (4.1)  is  a  block 
subsequence  (BSS)  of  x  if  z  is  a  subsequence  of  x,  i.e., 

^  (-A  i  1  ,  -  -  -  ,  X  ig_1  ,  'Pig  )  , 

1  <  *i  <  *2  <  •  •  •  <  is- 1  <  is  <  n, 

and  all  blocks  {bi,  b2, . . . ,  bj_i,  bj}  consisting  of  consecu¬ 
tive  elements  of  the  sequence  x  are  separated  in  x.  Obviously, 
if  a  pair  (x,  y)  G  V^(n,s)  (a  sequence  x  G  P$(n,s)),  then 
there  exists  a  block  presentation  z  which  is  a  common  BSS 
between  x  and  y  (x  and  x),  i.e.,  each  of  sequences  x  and  y 
(x  and  x)  contains  separated  blocks  (bi,  b2, , . . ,  bj_|ybj} 
consisting  of  their  consecutive  elements.  The  following  upper 
bound  on  the  size  \V^(n,s)\  is  true. 

Lemma  4.1.  For  any  s,  1  <  s  <  n,  the  size 


III.  Random  Coding  Bound  for  DNA  Codes 
For  an  arbitrary  integer  s,  0  <  s  <  n,  we  define  two  sets 

V(n,s)  =  {(x,y)  :  S(x,y)  =  s} 


and 

P(n,s)  =  (x  :  S(x,x)  =  s}, 


i.e.,  the  set  of  all  pairs  (x,  y)  for  which  the  similarity 
S(x,  y)  =  s  and  the  set  of  all  sequences  x  for  which  the  simi¬ 
larity  between  x  and  its  reverse  complement  x  is  S(x,  x)  =  s. 
For  fixed  parameter  u,  0  <  u  <  1,  define  functions 


p(u) 


a  v —  log3  [Pin,  f(l 
=  lim  — - - 


u)n])\ 


and 


p(u) 


A  —  log  g\P(n,\(l 
=  lim  — - - 


u)n})\ 


satisfying  inequalities  0  <  p(w)  <  2  and  0  <  p(u)  <  1 .  One 
can  obtain  (the  proof  is  omitted  here)  a  random  coding  bound 
on  the  rate  Rq(d)  which  is  given  by 

Lemma  3.1.  Let  d,  0  <  d  <  1,  be  fixed.  If 


then  the  rate 


min  {i  -  p{u)}  >  0, 

0  <u<d 


Rg(d)  >  min  {2  -  p(u)}. 

0  <u<d 


\r0(n,s)\<  qs  ■ 


min{s,  n—  s+1} 

E 


j=i 


The  proof  of  Lemma  4.1  is  omitted  here.  For  a  fixed  (/-ary 
s-sequence  z  =  (zi,Z2, . . . ,  zs)  and  its  j-block  presentation 

(4.1) ,  we  introduce  a  reverse  complement  j-block  presentation 

z  =  {bj,  bj_i, . . . ,  bo,  bi},  j  =  1, . . . ,  min{s,  n  —  s  +  1}. 

Lemma  4.2.  The  set  P:i  (n,  s)  is  empty  if  s  >  1  is  odd. 
If  s  >2  is  even  and  an  n-sequence  x  G  P:i  (n,  s),  then  there 
exist  an  integer  j,  j  =  1,  2, . . . ,  min{s,  n  —  s  +  1}  and  a  self¬ 
reverse  complementary  s-sequence  z  =  z,  z|  =  .s,  of  the  form 

(4.1)  which  is  a  common  block  subsequence  between  x  and  x 
and  z  has  a  self-reverse  complementary  block  presentation 


z  =  {bi,  bo, ... ,  bj_i,  bj}  =  {bj,  bj_i, . . . ,  bo,  bi}  =  z, 


i.e.,  block  b4  =  bj,  block  bo  =  bj_i,  . .  „  block  bj_i  =  bo, 
and  block  bj  =  bi . 

The  proof  of  Lemma  4.2  is  omitted.  Lemma  4.2  leads  to 
Lemma  4.3.  For  any  even  s,  s  G  [n],  the  size 

\pP(n,s)\<  q8^  ■ 


min{s,  n— s+1} 

E 


3= 1 


(  s/2  -  1 

Ui/21-i 


286 


For  s  G  [n],  consider  numbers 


and 


B(n,s)  =  max 

1  <min  {s ,  n— s+1} 


Let  u,  0  <  u  <  1,  be  fixed  parameter.  Introduce 
log9  B(n,  [(1  —  u)n\) 


Eq(u)  =  lim 


0  <u<  1. 


Lemmas  4.1  and  4.3  yield  upper  bounds  on  functions  p P(u) 
and  p  '  (u)  used  in  Lemma  3.1: 


P 8(u)  <  (1  +  u)  +  Eq{u), 

P'8{u)  <  ^[(1  +  u)  +  Eq(u)]. 

Therefore,  Lemma  3.1  gives  a  random  coding  bound  on 
the  rate  R8{d)  of  q- ary  DNA  (n,  \  dn\) -codes  based  on  the 
similarity  of  blocks.  One  can  easily  check  that  the  given  lower 
bound  Rg{d)  can  be  written  in  the  form  (2.1)-(2.3). 

Theorem  2  is  proved. 


V.  Proof  of  Theorem  1 

Let  s,  0  <  s  <  n,  be  an  arbitrary  integer  and 

Vx(n,s)  =  {(x,y)  :  SA(x,y)=s}, 
Px(n,s)  =  {x  :  Sx(x,x)  =  s}, 


denote  the  sets  from  Lemma  3.1  for  the  deletion  similarity. 

Lemma  5.1.  [2],  [7].  Let  n  and  s  be  integers,  0  <  s  <  n. 
For  an  arbitrary  q-ary  s-sequence  y  denote  by  Bg(y,  n)  the 
set  of  all  q-ary  n-sequences  x  that  include  y  as  a  subsequence, 
i.e.,  that  can  be  obtained  from  y  by  n  —  s  insertions.  Then  for 
the  fixed  n  and  s,  the  size  o/B9(y,  n)  does  not  depend  on  y 
and  has  the  form 


|Bg(y,  /x)|  = 

k= 0 

Lemma  5.2  The  set  Px(n,s )  is  empty  if  s  is  odd.  If  s 
is  even  and  an  n-sequence  x  G  Px(n,s),  then  there  exists  a 
self-reverse  complementary  s-sequence  z  =  z,  |z|  =  s,  which 
is  a  common  subsequence  between  x  and  x. 

Lemma  5.2  is  similar  to  Lemma  4.2.  Lemmas  5.1  and  5.2 
yield 

\PX(n,s)\  <  qs  ■  [Bq(n,s)]2  , 

\Px(n,  s)|  <  qs/ 2  •  Bq(n,  s),  0  <  s  <  n.  (5.2) 


If  u,  0  <  u  <  (q  —  1  )/q,  is  fixed,  then  from  definition  (5.1) 
it  follows 


log qBq{n,  f(l  ~u)n\) 
lim  - - - 

n— >oo  Tl 


wlog g(q  ~  1)  +  hg(u). 


Therefore,  applying  (5.2),  we  have 


PA(«) 


A  p  logg  \PX(n,  [(1 


M)nl)l 


< 


<  I  •  u  (  '2ii  log,, (7/  -  1)  +  2 hq(u)  (5.3) 


\  a  p  loS9  \PX(n,  f(!  ~u) 

p  (u)  =  lim  — - — 

n— >oo 


n])\ 


< 


n 


1 


<  2  '  I1  -u  +  2uloSq(q-  !)  +2hg(u)], 


(5.4) 


provided  that  0  <  u  <  (q  —  l)/q.  Hence,  if  0  <  d  <  (q  —  l)/q, 
then  from  (5.3)-(5.4)  it  follows 

min  {2  —  pA(w.)}  > 

0  <u<d 


>  1  +  d  —  2d\ogq{q  -  1)  -  2 hq(d)  =  Rq{(l) :  (5.5) 

min  {1  -  PX(u)}  >  1  •  Rq(d).  (5.6) 

0 <u<d  Z 

Inequalities  (5.5)-(5.6)  and  Lemma  3.1  yield  the  statement  of 
Theorem  1. 

Theorem  1  is  proved. 
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